This report reports on basic initial data exploration of text corpus from 3 sources, namely twitter, news website and blogs. After data cleaning, the 3 corpuses contain roughly 4 millions line of sentences and 60 millions word. Sampling 20% from each corpuses we summarize the most common word used and up to 4-word phase and plot them below.
This report reports on initial data cleaning and preliminary analysis of text corpus sourced from twitter, news and blogs on the internet, retrieved from Data science capstone project on Coursera, as part of the final project for the specialization
First I clean up the text to become tidy text by removing non-ASCII character from the text then turn all of them into lowercase and remove ‘Stop word’, which is word such as pronouns, the, etc. to decrease the computing burden in the machine learning step. Then I remove text emoticon and punctuation except for apostrophe.
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
## sentence
## 1: btw thanks rt gonna dc anytime soon love see way way long
## 2: meet someone special know heart beat rapidly smile reason
## 3: decided fun
## 4: tired played lazer tag ran lot ughh going sleep like 5 minutes
## 5: words complete stranger made birthday even better
## 6: first cubs game ever wrigley field gorgeous perfect go cubs go
Preliminary analysis included finding top 10 most common phrases in each corpus and summary of each corpus including line counts and word counts
noted that these are summaries after the data set has been cleaned and may not reflect the original data set. But since it is the set that will be used for training, we will analyze this data.
## [1] "tidyblog data set has a total of 19751975 words and a total of 897371 lines"
## [1] "tidytwitter data set has a total of 17530938 words and a total of 2352326 lines"
## [1] "tidynews data set has a total of 20936984 words and a total of 1009854 lines"
After some tinkering around, I decided to use 20% of each corpus for training data set. I tried to use all of the data set to create n-gram table but my desktop couldn’t handle it. I combined each corpus into new data set, then turn each sentence into short phase and word to count the most frequent word and phase used.
## ngram n
## 1: one 25238
## 2: just 20031
## 3: like 19713
## 4: can 19699
## 5: time 18335
## 6: get 14237
## 7: now 12061
## 8: know 11976
## 9: people 11963
## 10: new 11107
## ngram n
## 1: right now 1036
## 2: years ago 995
## 3: even though 990
## 4: new york 979
## 5: year old 908
## 6: first time 886
## 7: 1 2 875
## 8: feel like 874
## 9: u s 857
## 10: can see 843
## ngram n
## 1: new york city 169
## 2: 1 2 cup 150
## 3: new york times 135
## 4: couple weeks ago 97
## 5: amazon services llc 96
## 6: llc amazon eu 96
## 7: services llc amazon 96
## 8: 1 4 cup 92
## 9: 1 1 2 89
## 10: new york n 70
## ngram n
## 1: amazon services llc amazon 96
## 2: services llc amazon eu 96
## 3: new york n y 68
## 4: backgroun none repeat scroll 55
## 5: none repeat scroll 0 55
## 6: repeat scroll 0 0 55
## 7: scroll 0 0 yello 55
## 8: style backgroun none repeat 55
## 9: 0 0 yello class 53
## 10: advertising fees advertising linking 48
For the machine learning algorithm, I plan to use markov chain model from markovchain package and using 1-4 gram model for text prediction with stupid-backoff. Then I will save model for later use and load it into shiny apps to use for prediction after evaluation with validation data set.